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GENOMIC DATA MINING USING CLUSTERING AND LOGIC FILTERING 

CRITERIA 

PRIOR RELATED APPLICATIONS 

[1] This application claims priority to EP 02425791.7 filed on Dec. 23, 2002, the 
disclosure of which is hereby incorporated by reference. 

FIELD OF THE INVENTION 

[2] The present invention generally relates to genomic analysis. More particularly, 
it relates to a data mining system for the analysis of gene expression data that employs both 
clustering criteria and logic filtering criteria and allows clustering according to multiple 
parameters. 

BACKGROUND OF THE INVENTION 

[3] In the year 2001 the first draft of the sequence of the human genome was 
completed. As of November 2003, over 164 complete genomes had been published, including 
mouse, fruit fly, pufferfish and yeast. This wealth of knowledge provides researchers with 
fundamental tools for preventing or treating diseases, which in many cases are considered to 
be caused or exacerbated by the simultaneous action of different genes. 

[4] In traditional genomic study, a single gene was studied at a time. However, 
genomic research is nowadays directed towards the development of technologies that allow a 
parallel analysis of thousands of genes at a time. 

[5] The so-called "gene-chips" or "DNA-chips" are extraordinary tools for 
studying patterns of gene expression. Gene-chips are large arrays of nucleic acid probes, 
arranged in matrix format on a surface such as a microscope slide. Gene-chips can contain 
hundreds to hundreds-of-thousands of such probes. Thus, the devices are called "arrays" or 
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"microarrays," and with future advances in array printing even "nanoarrays" will become 
practical. 

[6] These recent advances allow researchers to measure expression levels for 
thousands of genes simultaneously, across different conditions and over time. Analysis of data 
produced by such experiments offers potential insight into gene function and regulatory 
mechanisms. However, a key step in the analysis is the detection of groups of genes that have 
similar expression patterns. The corresponding algorithmic problem is to group or "cluster" 
gene expression patterns and correlate these with a variety of different parameters, such as 
time, drug response, disease status, patient, and the like. 

[7] Modern data mining technology can handle all three primary learning tasks: 
classification, regression, and clustering. However, clustering is used most commonly in the 
data mining of genomic information and several clustering algorithms are available. 

[8] In the general parlance of bioinformatics a clustering problem consists of n 
elements and a characteristic vector for each element. A measure of similarity is defined 
between pairs of such vectors. In gene expression, elements will be genes, the vector of each 
gene will contain its expression levels under each of the conditions, and similarity can be 
measured, for example, by correlation coefficient between vectors. The goal is to partition the 
elements into subsets, which are called clusters, so that two criteria are satisfied: 1) 
Homogeneity - whereby elements inside a cluster are highly similar to each other; and 2) 
Separation - whereby elements from different clusters have low similarity to each other. 

[9] There is a very rich literature on cluster analysis going back over three decades. 
Several algorithmic techniques have been used in clustering gene expression data, including 
hierarchical clustering, such as Cluster Identification via Connectivity Kernals (CLICK) or the 
divisive hierarchical algorithm called DIANA, model based approaches such as the Beyesion 
Infinite Mixture Model (IMM), and mixed approaches such as a finite Gaussian mixture 
model-based hierarchical clustering algorithm from MCLUST. There are also iterative 
approaches such as k-means and Cluster Affinity Search Technique (CAST). There are other 
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approaches such as simulated annealing, self organizing maps (SOM), and graph theoretic 
approaches. There are even several publicly available software packages for clustal analysis, 
including MCLUST, Vera and SAM, KNNimpute, dCHIP and the BioConductor project. 

[10] However, a limit of the known data mining techniques is that it is not possible 
to identify groups or sequences of genes by simultaneously applying a plurality of properly 
weighed criteria for grouping according to gene expression with time and a variety of specific 
properties of particular interest. 

SUMMARY OF THE INVENTION 

[11] The invention generally relates to a method of analysis of genomic information 
in order to determine relationships among genes. The method of this invention allows one to 
determine complex relationships among genes that go beyond the simple clustering operations 
of the known methods of determining which genes are co-expressed or co-regulated. 

[12] The method of this invention is applicable to a table of data relative to the 
evolution of the gene expression with time or relative to different stress conditions, and does 
not depend on the method used for obtaining the data. 

[13] First, a certain clustering algorithm is chosen and applied to the data of the 
table, obtaining sub-tables of data relative to groups of genes (clusters). Therefore, all 
possible pairing combinations of the sub-tables of data are generated and characteristic 
parameters are calculated for genes contained in these sub-tables. 

[14] Then, for each combination a characteristic value is calculated with a decision 
algorithm defined in function of these parameters, by considering the genes of the 
combination as constituting a 'Gene Network' if the characteristic value exceeds a pre-defined 
threshold. 

[15] Preferably, a certain group of logic filtering criteria of the data of the table is 
chosen, generating other sets of sub-tables of data of groups of genes that satisfy the 
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respective logic criterion. Pair combinations of the sub-tables, obtained with logic or 
clustering criteria, are calculated. 

[16] Preferably, the decision algorithm is a fuzzy logic algorithm, the antecedents 
and consequents of which are defined in function of the characteristic parameters 

[17] The method of this invention may be implemented by a relative system of 
identification of groups of co-expressed and co-regulated genes. The core of such an 
identification system is an intelligent sub-system that processes the characteristic parameters 
of groups of genes and outputs data of groups of genes identified as 'Gene Networks.' 
Preferably, this intelligent sub-system is an off-line trained, fuzzy logic processor sub-system 
structured as a neural network. 

DESCRIPTION OF THE DRAWINGS 

[18] A more complete understanding of the method and apparatus of the present 
invention may be acquired by reference to the following Detailed Description when taken in 
conjunction with the accompanying Drawings wherein: 

[19] Figure 1 is a diagram of levels of gene expression at different instants relating 
to the same sample of DNA; 

[20] Figure 2 is an interpolated diagram of the levels of gene expression at a certain 
pre-established instant, relative to different samples (Tl, T6) of DNA; 

[21] Figure 3 is an example of a report obtained carrying out a query at the website 
LocusLink. The report shows that kind of information that is already known about a 
particular gene and might be included in clustering techniques. In this example, known 
information includes that the protein is a membrane protein involved in ion transport, and a 
variety of known domains are found in the protein; 

[22] Figure 4 depicts a preferred embodiment of a system of the invention; 
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[23] Figure 5 shows sample scatter diagrams; 

[24] Figure 6 shows examples of diagrams of data correlated according to a 
quadratic law; 

[25] Figure 7 shows possible time evolutions of gene sequences; and 

[26] Figure 8 shows a set of data for training the Fuzzy system of the invention. 

DESCRIPTION OF THE INVENTION 

[27] The application makes reference to a number of Tables, the identification and 
general contents of which are as follows: 

[28] Table 1 is sample table of data relating to gene expression; 

[29] Table 2 contains a set of values of gene expression for the yeast S. cerevisiae 
in different instants; 

[30] Table 3 contains data extracted from the Saccharomyces Genome Database for 
several genes mentioned in Table 2; 

[31] Table 4 contains data relating to several genes from Table 2 that have been 
grouped in a CLUSTER; 

[32] Table 5 shows possible combinations between two groups of genes and the 
characteristic value associated with each combination; 

[33] Tables 6 and 7 show data relative to genes grouped in CLUSTERS 26 and 30; 

[34] Table 8 shows levels of gene expression of the combination among CLUSTERS 
ofTables6and 7; 

[35] Table 9 shows levels of expression of Table 8 normalized to range between 0 
and 1 ; and 
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[36] Table 10 shows values of increments of levels of expression of Table 9. 

[37] The method of the invention allows one to identify groups of genes ('Gene 
Networks') that are likely to be involved in a particular cellular process. This method is based 
on a decision algorithm that identifies groups of co-expressed or co-regulated genes using 
both clustering criteria and logic filtering criteria. 



[38] Table 1 shows a sample table of data relating to gene expression. 
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TABLE 1 

[39] From each group of genes, characteristic parameters are calculated and, with a 
decision algorithm based on these characteristic parameters, a characteristic value is 
calculated. When this characteristic value exceeds a certain pre-established threshold, the 
relative group of genes is identified as a constituent of a "Gene Network," otherwise it is 
discarded. 

[40] The remarkable advantage of the this technique is the fact that the limits of 
prior methods based exclusively on clustering are overcome, and the invention allows one to 
identify a group of genes as a "Gene Network" according to a number of differently combined 
criteria. 

[41] Preferably, the decision algorithm is a fuzzy logic algorithm configured such to 
identify correlations among genes within a large amount of data, corresponding to the time 
variable of gene expression or to different cellular conditions. 

[42] Reference and brief description of Figures 1-3 has already been provided. 
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[43] A schematic representation of a preferred embodiment of a system 
implementing the method of this invention is shown in Figure 4. 

[44] The system includes three sub-systems: 

1 . A pre-processing sub-system (CLUSTERING, FILTERING), that 
generates groups of tables using clustering criteria and logic 
filtering criteria; 

2. A processing sub-system (GENERATING COMBINATIONS, 
EXTRACTING CHARACTERISTICS) that generates groups of genes 
that are Gene Networks candidates, by combining pairs of sub- 
tables and extracting characteristic parameters for each 
combination of genes; and 

3. An intelligent sub-system (NEURO SYSTEM, FUZZY SYSTEM, 
THRESHOLD), trained off-line, that output groups of genes 
identified as Gene Networks. 

[45] The intelligent sub-system is preferably a fuzzy logic system that is trained off- 
line. It is capable of attributing to each candidate group of genes a characteristic value using a 
soft computing decision making algorithm (see, e.g., reference (22)). If this characteristic 
value exceeds a pre-established THRESHOLD, the relative group of genes is identified as 
constituting a Gene Network. 

[46] Clustering and Filtering: The pre-processing sub-system generates similar 
groups of gene sequences using clustering criteria and logic filtering criteria. There are plenty 
of clustering criteria in literature, including but not limited to those cited herein: 

1. Hierarchical Agglomerative; 

2. Non hierarchical K-means; 

3. Hierarchical sequential K-means; 

4. Non hierarchical SOM; and 

5. Non exclusive Fuzzy Clustering. 

[47] For each gene, m values of gene expression are reported, relating to m 
experiments carried out at different instants or under different conditions. The system 
generates a certain number of groups of genes (CLUSTERS) according to the criterion used and 
the initial settings chosen for the processing. 
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[48] In order to study Gene Networks, it is desirable to select groups of genes that 
present characteristics in addition to their similarity of time expression profiles. This is made 
possible through filtering techniques that select groups of genes, depending on the value 
assumed by one or more attributes of the gene itself. 

[49] The criteria to be used must be chosen as appropriate for grouping similar gene 
sequences. For example, when the influence of extended groups of genes among them and 
towards single genes must be verified, it is preferable to use a stringent logic filtering criterion 
and clustering algorithms that generate extended groups of CLUSTERS. Stringent logic filtering 
refers to a logic filtering criterion defined by plural logic conditions to be simultaneously 
satisfied and by virtue of which is likely to select only a relatively small number of genes. 
Thus, for identifying a group of correlated genes in a very numerous group of genes, or for 
finding genes belonging to a numerous group of genes that are correlated to one specific gene, 
it is desirable to use a clustering algorithm likely to produce a numerous group of genes, in 
combination with a "stringent" logic filtering for identifying a relatively small group of 
correlated genes. Groups of genes that may be correlated are identified by performing a 
clustering operation in combination with a filtering operation. Such a choice may consist of a 
single linkage hierarchical method coupled to the metric used for updating the matrix of 
distances. 

[50] A filtering criterion is any logic criterion that an user may impose on the data. 
For example, a filtering criterion may consist of selecting all the genes whose level of 
expression exceeds a certain value at the beginning of the experiment. Another filtering 
criterion may be that of considering those genes whose expression level changes on challenge 
with a test drug, and the like. 

[51] Generation of combinations and composition of groups of gene candidates to 
be a Gene Network: Let us suppose we have generated K sub-tables of genes (CLUSTER) with 
a certain clustering criterion and M sub-tables of genes (FILTER) with certain logic filtering 
criteria. According to the method of this invention, all possible groups of genes are generated 
by combining in pairs the sub-tables: 
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1 



K) K\ _ K*(K-\) 



CLUSTER-CLUSTER combinations; 



2 J (K-2)*2\ 2 



2 



r M\ _ M\ _ M*(M-1) 
< 2j"(M-2)*2!" 2 



FILTER-FILTER combinations; and 



3 



AT*M CLUSTER-FILTER combinations. 



[52] Preferably, among these combinations, the ones that generate groups of genes 
with a number of genes smaller than a certain pre-established threshold are discarded together 
with the combinations that generate groups of genes already obtained in a previous 
combination. 

[53] Each gene of the combination may be labeled with a string that indicates the 
group it belongs to. For example, a gene is labeled C2 if the group it belongs to is the cluster 
2. In a combination FILTER-FILTER or CLUSTER-FILTER, a gene present in both groups is 
labeled FjFj or QFj, where the subscripts i and j are the indexes of the CLUSTER or the FILTER 
the gene is coming from. 

[54] When the aim is to determine how the behavior of a certain gene influences a 
whole CLUSTER, it is preferable to generate combinations among a FILTER constituted only by 
that gene and groups (CLUSTERS) constituted by more genes. 

[55] Extraction of characteristics: The most significant phase is the extraction of 
characteristics because they indicate the type of correlations that must be identified. 
According to an innovative feature of the present invention, numerical parameters, tied to the 
gene expression profile, and parameters that, in contrast have a semantic meaning are both 
used. Mixed parameters that are combination of both elements may also be used. 

[56] A sample semantic parameter that is considered in Gene Network analysis is 
the percentage of genes of the combination with the same functional domain. A number 
ranging from 0 to 1 is associated to this percentage. When the value of the parameter is 1, all 
the genes of the considered combination have the same functional domain. When the value is 
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null, the genes do not have any common domain, while in all other cases the value of this 
parameter is comprised between 0 and 1 . 

[57] Similarly, another semantic parameter relates to the percentage of genes that 
have ontologies belonging to the same category. It is evident that other semantic parameters 
could be considered by extending this analysis to other semantic characteristics of gene 
sequences. It is emphasized that these parameters refer to semantic characteristics, but are 
expressed in numerical form. 

[58] According to one effective embodiment of this invention, six numeric 
parameters PI, P6 are used. Every parameter ranges between 0 and 1. 

[59] The first parameter PI is the absolute value of the linear correlation coefficient 
among the expressions of pairs of genes of the same combination if the correlation is positive, 
otherwise it is null. The second parameter P2 is analogous to PI, but is null if the linear 
correlation is positive. The third parameter P3 indicates the value of the quadratic correlation 
of the combination. The more the value of the correlation is close to 1, the more the genes of 
the combination are correlated. 

[60] The fourth parameter P4 indicates the percentage of genes of the group whose 
value of final gene expression (that is the last attribute of the gene) is greater or smaller than 
the initial value of gene expression (first attribute). In practice, the percentage of genes that 
have the same global variation is calculated. 

[61] The fifth parameter P5 indicates the percentage of genes of the group that has 
the same time evolution (increasing or decreasing). Finally, the last parameter P6 indicates 
the percentage of genes that have peak expression at the same time. 

[62] These parameters are used to verify whether the cluster contains differently 
expressed genes that participate in the same cellular process and so whether the relationships 
among them may be modeled by a Gene Network. 
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[63] Using the above specified six parameters in order to determine groups of co- 
expressed and co-regulated genes provides a robust method of identification, capable of multi- 
objective discrimination. 

[64] It must be remembered that although the approach is modeled with the six 
parameters described above, it may be generalized to any parameter of interest that expresses 
a correlation of any kind. Moreover, it is possible to use parameters that may have a 
completely semantic biological meaning or more complex mixed parameters that express at 
the same time a numerical and a semantic correlation. In the latter case, public databases exist 
to which a query may be submitted for obtaining a numerical codification that expresses the 
eventual semantic correlation. 

[65] A detailed examination of the six proposed parameters is examined hereinafter. 

[66] Parameters relating to correlation (PI, P2, P3): The correlation indicates the 
level of relationship among genes. Through these parameters, it is possible to determine how 
a linear equation or any other equation is appropriate to describe or explain such a 
relationship. 

[67] Given that X and Y are two time profiles of gene expression, it is possible to 
make a scatter diagram in a system of Cartesian coordinates. Should all the points of the 
scatter diagram lay around a straight line, the correlation is linear. In this case the equation 
that ties the two variables is a linear equation: 

Y=a+bX (1) 

[68] If Y increases when X increases, the correlation is said to be positive or direct. 
If Y decreases when X increases, the correlation is said to be negative or inverse. If there is 
not any linear relation between the two sequences, they are said to be uncorrelated. The 
degree of linear correlation between two gene sequences is given by the linear correlation 
coefficient defined as follows: 
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Y(X-X)(Y-Y) 

wherein the sum goes from 1 to m (being m the number of levels of expression calculated for 

- Y x — YJ 

each gene) and X - — — and Y - — — are the mean values. 

m m 

[69] The linear correlation is maximum when the absolute value of the coefficient p 
is equal to 1 (the sign depends on the fact that a variable increases or decreases when the other 
variable increases). 

[70] Figure 5 illustrates examples of scatter diagrams. A null value of the linear 
correlation coefficient implies only the absence of a linear correlation, nevertheless two 
sequences may be strongly dependent from each other while not presenting a strong linear 
correlation. A typical case is that of points distributed along the circumference of a circle. 

[71] Sometimes, the correlation between two gene sequences may be of a quadratic 
type, that is the relationship between X and Y is the equation of a parabola: 

Y=a+bX+cX 2 (3) 

wherein a is a constant, b is the linear growth coefficient and c is tied to the curvature and is 
due to the relation between Y and the square power of X. Figure 6 shows examples of 
quadratic correlation. 

[72] In general, whichever the relation between X and Y is, the correlation 
coefficient is defined as: 

Wkstim-tf 
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wherein Ystim is the interpolated value of Y obtained by the method of minimum squares. It 
is worth noticing that r is a non dimensional quantity, that is it does not depend on the unity of 
measure. When the relation between X and Y is linear, r coincides with the linear correlation 
coefficient, otherwise it has a more general meaning. Moreover, in the case in which the 
relation is linear, 

rxY = ryx 

that is the quantity r is the same irrespectively from the fact that X or Y is the independent 
variable. In general: 

rxY^ryx 

[73] As said before, the first three extracted parameters relate to the linear and to the 
quadratic correlation. Let us consider one of the generated combinations and let us assume 
that the corresponding group be constituted by n genes. When a number of gene sequences 
greater than two is considered, instead of the linear correlation coefficient, a linear correlation 
matrix R, defined as follows, is considered: 

1 Pn - An 
R= P » 1 - (5) 

Pnl Pn2 - 1 

#j being the correlation coefficient between the sequences of the gene i with the gene j. Of 
course, the correlation coefficient of a gene sequence with itself is 1, that is #j=l for each i=l 5 
..,n. 

[74] If for each i^j is #j=0, the n gene sequences are uncorrelated. In this case the 
determinant of the matrix R is 1, while in general it ranges between 0 and 1 . Considering that 
#j=l and that Aj = Pji> the number of calculated coefficients is: 
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ri\ _ n\ _ n*(n-\) 



(6) 



2 J (w-2)*2! 2 



[75] Once these values are calculated, the interval from 0 to 1 is divided in sub- 
intervals, for example in five equal sub-intervals of amplitude 0.2 and the number of 
coefficients comprised in each sub-interval is counted. Moreover, to each sub-interval a 
correlation value is associated, for example equal to 0.1, 0.3, 0.5, 0.7, 0.9, respectively. If one 
of the five sub-intervals contains a number of coefficients greater than 50% of the total 
number of coefficients, the value of the first parameter is the correlation value corresponding 
to that interval. On the contrary, in the case in which the coefficients are distributed mainly 
into two intervals, the value of the first parameter is the arithmetical mean of the correlation 
values of these two intervals. 

[76] By assuming for example that ni coefficients are in the sub-interval to which is 
associated a correlation value v i? n 2 coefficients are in the sub-interval with correlation value 
v 2 , and (ni+n 2 )>50% of the total number of coefficients distributed in the five intervals, the 
value assigned to the first parameter PI will be given by: 



[77] Finally, in the case in which the majority of coefficients is distributed in more 
than two intervals, the first parameter PI is the mean value of all the coefficients. The first 
parameter is calculated considering only the coefficients pij>0. 

[78] The second parameter P2, relating to the negative linear correlation, is similarly 
calculated but considering the coefficients y9y<0 and dividing the interval from -1 to 0 in five 
equal intervals. 

[79] Referring to the calculation of the third parameter P3, the correlation 
coefficients are calculated by considering the most general form of the correlation coefficient 



Pl = 



(7) 
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given in eq. (4). Considering that rxY^ryx, the number of coefficients to be calculated, in the 
case of a combination with n gene sequences, is: 

n*(n-l) (8) 

[80] For calculating r it is necessary to know Ystim, that is the interpolated value of 
Y by means of the method of minimum squares. The minimum squares parabola interpolating 
the set of points (Xi, Yi), with i=l, n is given by eq. (3): 

Y-a+bX+cX 2 (3) 

wherein a, b and c are determined by solving the following three equations: 

<Y j XY = aJ^X + bY j X 2 +cXX 3 (9) 

known as the canonical equations of the minimum squares parabola. 

[81] Given that the values of the constants are known, by substituting them in eq. 
(3), the value of Ystim and thus the value of r are calculated. 

[82] To the third parameter is attributed the mean value of the so calculated n*(n-l) 
correlation coefficients. 

[83] It should be expected that the combinations CLUSTER-CLUSTER have relatively 
high correlation values, because the clustering criteria select groups of genes with a high level 
of correlation among them. Of course, even CLUSTER-FILTER and FILTER-FILTER correlations 
may have high correlation values. In general the correlation parameter provides more 
complete indications than indications obtained by using clustering criteria. 

[84] For a better understanding of this aspect, suppose for example of considering 
two gene sequences X and Y constituted by three values of time expression, X=[l; 5; 7] and 
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Y=[10; 50; 70], The relation that ties X and Y is Y=10X and so the linear correlation 
coefficient is 1. 

[85] Nevertheless, the clustering criteria do not highlight this kind of relationship. In 
fact the majority of implemented clustering techniques uses distance metrics. Two gene 
sequences with very similar values of gene expression are grouped in the same CLUSTER 
because they identify very close points in the /^-dimensional space. 

[86] By contrast, in the cited example, even if there is a linear relation between two 
sequences, they identify distant points of the space and thus, probably, do not belong to the 
same CLUSTER. The only clustering criterion that make an exception to this rule is the 
agglomerative method using Pearson's coefficient. In fact, this metric is a measure of 
similarity and not of distance and does not satisfy metric properties. 

[87] Parameters relating to the pattern of expression: The last three extracted 
parameters, P4, P5 and P6, relate to similarity among gene sequences in terms of time-variable 
or condition-variable pattern of expression. In particular, the sign of the full variation is 
considered, the type of evolution (increasing or decreasing) and the coincidence of peak 
variation at the same point of time. 

[88] The fourth parameter P4 indicates the percentage of genes that behave in a 
similar way from the point of view of the whole variation of the value of gene expression. 

[89] For each gene sequence of the examined combination, the variation between 
the value of final gene expression (that is relating to the last attribute) and the initial one 
(relating to the first attribute) is calculated, by considering preferably only the absolute value 
of the variation (independently from the sign). Being the number of gene sequences that have 
a value of final gene expression greater that the starting known one, the percentage of 
sequences that have a positive variation is calculated. The fourth parameter is a number 
comprised between zero and one, indicative of this percentage. 
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[90] In practice, the closer to 50% the percentage of genes having the same 
variation is, the closer to 0 the fourth parameter P4 is. The closer to 100% that percentage is, 
the closer to 1 the parameter P4 is, because the majority of genes of the group behaves in the 
same way. 

[91] In the case in which the percentage is small and close to 0, the value of the 
parameter is high and is close to 1 . This is due to the fact that small percentages of genes 
having a positive variation, imply high percentages of sequences with negative variation. This 
parameter is used for identifying groups of genes with a similar behavior from the point of 
view of the whole variation of the value of gene expression, independently from the sign of 
the variation. Finally, percentages of 70% or equivalently of 30% are associated to values of 
the parameter close to 0.5. 

[92] Nevertheless, it must be considered that a gene sequence with a value of final 
gene expression greater than the initial value does not necessarily increase with time. Vice 
versa, a negative variation does not necessarily imply that the expression is decreasing. In 
order to identify a Gene Network, it is important to identify genes presenting a similar time 
evolution, whether increasing or decreasing, independently from the values of single 
attributes. To illustrate this let us consider Figure 7. 

[93] The three sample gene sequences A, B, C increase with time, even if their 
evolution are completely different and the values of gene expression differ among them. 
Moreover the sequence A, even if it is increasing, has a negative variation between the final 
and the initial values. This detail is not considered by clustering criteria and it is not 
highlighted by the four parameters described above. 

[94] For this reason a fifth parameter P5 has been introduced for accounting for 
considering this characteristic. The percentage of genes that present an increasing evolution is 
calculated for each combination. In function of the obtained percentage, a value ranging from 
0 to 1 is assigned to this fifth parameter. 
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[95] The value assignment is very similar to that illustrated for the fourth parameter: 
at a very small or very large percentage value corresponds a value of the parameter close to 1, 
while a value of the parameter close to 0 is associated to a percentage close to 50%. 

[96] The sixth parameter P6 refers to gene sequences of the considered group that 
show a peak variation of the same point of time. 

[97] An external cause, such as the administering of a test drug or the changing the 
environmental conditions, such as sharp change in temperature, may cause a strong increase or 
reduction of the level of gene expression. The coincidence presence of peak variation at the 
same instant, may lead to the identification of a group of genes that react similarly in presence 
of a certain external agent. 

[98] In consideration of the fact that the values of gene expression are normalized 
and thus they range between 0 and 1, a threshold value of half the amplitude (0.5) of the 
normalization interval is considered. 

[99] The excursion peak value is calculated for each sequence belonging to the 
combination to be examined. If no gene of the combination has a peak variation that 
surpasses the threshold, a null value is assigned to the last parameter P6. If all the genes of the 
group have a "peak" (that is a maximum variation greater that the threshold) occurring at the 
same point of time, the parameter is 1. In all other cases the parameter has a value equal to 
the percentage of genes that show a peak at the same point of time. 

[100] Intelligent sub-system: The intelligent sub-system is based on Soft Computing 
methods, preferably it is a neural fuzzy system whose rules may be either: 

1 . introduced by the user by way of a programming language 
using clauses such as EF. . . THEN; or 

2. generated by means of a neural network with weights and 
thresholds representing the characteristic parameters. 



HOUSTON 296563v3 64659-00003USPX 



18 



CUSTOMER NO. 23932 



PATENT APPLICATION 
Atty. Docket No. 64659-3USPX 



[101] According to the latter alternative, the sub-system must be trained first (off-line 
learning) with an appropriate set of data (learning matrix), such as the sample set of data of 
Figure 8. 

[102] While functioning off-line, the output of the Fuzzy system (characteristic 
value) is compared with a threshold value THRESHOLD. Among the obtained group of genes, 
the groups associated to a characteristic value greater than the threshold are identified as Gene 
Network, while the other groups are discarded. 

EXAMPLE 1 

[103] For better illustrating the method of this invention, a sample application will 
now be described. The input data are constituted by levels of gene expression of certain 
sequences to be examined. Table 2 depicts a portion of the set of data used for the 
experiment. 



Genbank 


alpha 0 


alpha 7 


alpha 14 




Alpha 21 


alpha 28 


alpha 35 


YBR166C 


0.33 


-0.17 


0.04 




-0.07 


-0.09 


-0.12 


YOR357C 


-0.64 


-0.38 


-0.32 




-0.29 


-0.22 


-0.01 


YLR292C 


-0.23 


0.19 


-0.36 




0.14 


-0.4 


0.16 


YDL120W 


0.11 


0.32 


0.03 




0.32 


0.03 


-0.12 


YGL248W 


-0.25 


0.26 


0.01 




-0.06 


-0.42 


-0.07 


YIL146C 


-0.58 


-0.29 


-0.45 




-0.15 


-0.86 


-0.36 


YJR106W 


-0.36 


-0.17 


-0.22 




-0.34 


-0.36 


0.03 


YBR123C 


-0.17 


-0.32 


-0.34 




-0.42 


-0.25 


-0.3 


















YHR047C 


-0.29 


-0.07 


-0.34 




-0.34 


-0.36 


-0.43 


YMR055C 


-0.34 


0.88 


-0.42 




-0.97 


-0.15 


-0.29 


YDR457W 


0.01 


-0.69 


-0.09 




-0.09 


0.25 


0.21 



TABLE 2 



[104] In the first column of Table 2 there are the accession numbers of genes, which 
in this specific case belong to the genome of the yeast S. cerevisiae. The first letter of each 
accession number is Y which stands for "yeast." Further, the name of the experiment is 
ALPHA and is based on 18 separate measurements at different times. 
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[105] First the yeast were synchronized by withdrawal of the alpha factor. Then 
gene expression was measured at various time intervals. All the measurements were done by 
taking the value of gene expression at the instant t=0 as reference (second column of the 
table). The other columns show the levels of gene expression measured after 7 min, 14 min 
and so forth. 

[106] For each gene, identified by the corresponding accession number, it is possible 
to gather additional information, such as its description (Description), the functional category 
(Molecular Function) and the annotation (Biological Process) of the specific gene. This 
information is available in the Saccharomyces Genome Database, an example of which is 
shown in Table 3. 



GenBank 


Description 


Molecular Function 


Biological Process 


YBR166C 


TYR1 TYROSINE BIOSYNTHESIS... 


GO:8977 prephenate.. 


GO:6570 tyrosine metabolism 


YOR357C 


GRD19 SECRETION GOLGL.. 




GO:81 04 protein localization 


YLR292C 


SEC72 SECRETION ER PROTEIN... 


GO:5047 signal recognition... 


GO:6615 SRP-dependent... 


YDL120W 


YFH1 IRON HOMEOSTASIS... 


GO:5554 unknown 


GO:6879 iron homeostasis 


YGL248W 


PDE1 PURINE METABOLISM 3' ,5'... 


GO:41 15 cAMP-specific... 


GO: 19933 cA MP- mediated.. 


YIL146C 


ECM37 CELL WALL BIOGENESIS... 


GO:5554 unknown 


GO:7047 cell wall organization.. 


YJR106W 


ECM27 CELL WALL BIOGENESIS... 


GO:5554 unknown 


GO:7047 cell wall organization... 


YBR123C 


TFC1 TRANSCRIPTION... 


GO:3709 RNA polymerase III... 


GO:6384 transcription initiation... 











TABLE 3 



[107] Results obtained with a clustering criterion: 1533 yeast genes have been 
considered. Each gene is described by eighteen levels of gene expression, corresponding to 
the values measured at intervals of 7 minutes, following the ALPHA experiment (instant t=0). 

[108] These sequences have been grouped by the K-means algorithm by adopting an 
initial number of centroids equal to 50 and a random method of generating centroids. 

[109] At the end of the clustering process, 50 sub-tables (CLUSTERS) were obtained, 
equal to the number of centroids that had been initially chosen. This condition indicates that 
there were no empty clusters, that eventually would have been discarded at the end of the 
clustering process. 
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[110] Table 4 shows the content of the file kmyeast50.txt relating to the fiftieth 
CLUSTER that contains 9 gene sequences. 
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The file indicates also the accession numbers (GenBank) and the eighteen values of gene 
expression of the sequences (PARAMETERS) constituting the CLUSTER. 

[Ill] Filtering phase: According to this invention it is possible to optionally perform 
a filtering phase by which it is possible to select several of the considered gene sequences by 
function of the values of gene expression assumed at different instants. For example it is 
possible to filter all the sequences whose value of gene expression at the instant t=0 is greater 
than zero or it is possible to consider a filtering criterion relating to more than one parameter 
at the time. Nevertheless, this phase is optional and for sake of simplicity it has been waived 
for this example. 

[112] Generation of combinations: During this phase all the combinations CLUSTER- 
CLUSTER are generated. It is evident that, if the filtering phase had been performed, the 
combinations FILTER-FILTER and CLUSTER-FILTER would have been generated as well. 

[113] In this case, considering that the number of combinations is 
OH K\ _K*(K-\) 



(K-2)*2\ 



and considering that the number K of CLUSTERS generated in the previous phase is 
50, 1225 combinations have been obtained. 

[114] For each combination the above mentioned six numerical parameters PI, P6 
have been calculated. For the calculated parameters, the intelligent sub-system assigned to 
each combination a characteristic value ranging between 0 and 1. All the combinations with a 
characteristic value (degree of Gene Network), greater than a pre-established threshold have 
been identified as possible Gene Networks. In this example, a threshold value of 0.5 was set 
and six possible Gene Networks were identified. The printed file gnyeast.txt shown in Table 5 
stores information relating to the generated combinations. 
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NUMBER OF GENE- 
NETWORKS: 6 

MAX NUMBER OF ELEMENTS: 1533 DEGREE 



gnyeastl .txt 


10 


kmyeast22.txt 


kmyeast26.txt 


CLUSTER 22 


CLUSTER 26 


1 


gnyeast2.txt 


19 


kmyeast22.txt 


kmyeast34.txt 


CLUSTER 22 


CLUSTER 34 


0.91 


gnyeast3.txt 


30 


kmyeast26.txt 


kmyeast30.txt 


CLUSTER 26 


CLUSTER 30 


0.67 


gnyeast4.txt 


13 


kmyeast26.txt 


kmyeast44.txt 


CLUSTER 26 


CLUSTER 44 


0.68 


gnyeast5.txt 


10 


kmyeast26.txt 


kmyeast45.txt 


CLUSTER 26 


CLUSTER 45 


1 


gnyeast6.txt 


19 


kmyeast34.txt 


kmyeast45.txt 


CLUSTER 34 


CLUSTER 45 


0.8 


Xgnyeastl .txt 


34 


kmyeast1.txt 


kmyeast2.txt 


CLUSTER 1 


CLUSTER 2 


0 


Xgnyeast2.txt 


43 


kmyeast1.txt 


kmyeast3.txt 


CLUSTER 1 


CLUSTER 3 


0.3 


Xgnyeast3.txt 


57 


kmyeastl .txt 


kmyeast4.txt 


CLUSTER 1 


CLUSTER 4 


0 


Xgnyeast4.txt 


38 


kmyeast1.txt 


kmyeast5.txt 


CLUSTER 1 


CLUSTER 5 


0 


Xgnyeast5.txt 


54 


kmyeast1.txt 


kmyeast6.txt 


CLUSTER 1 


CLUSTER 6 


0 


Xgnyeast6.txt 


59 


kmyeastl .txt 


kmyeast7.txt 


CLUSTER 1 


CLUSTER 7 


0 

















TABLE 5 



[115] In the first column the names of files containing more detailed information on 
the generated combinations are indicated. The names of the files that start with the letter X 
refer to the combinations to which the intelligent sub-system assigned a value smaller than 
0.5. By contrast, the remaining files refer to the combinations that the system identified as 
possible Gene Networks, and which in this specific case were 6. 

[116] The second column contains the number of gene sequences constituting the 
examined combination, while the remaining columns indicate the type of combination 
(example CLUSTER22-CLUSTER26) . 

[117] The last column represents the value assigned from the previously trained 
neural fuzzy system. It is evident that the closer the assigned value is to 1, the more likely the 
examined combination is a Gene Network. Vice versa, the closer the output value is to 0, the 
more unlikely is the fact that the combination be a Gene Network. 

[118] For example looking to the third row, the combination between the CLUSTER26 
(C26), containing 9 gene sequences, and the CLUSTER30 (C30), containing 21 sequences, has a 
value of 0.67. Thus it has been recognized as a possible gene network. 
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[119] Tables 6 and 7 contain data relating to CLUSTER26 (C26) and CLUSTER30 

(C30). 
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[120] All necessary information relative to the combined set of data is reported in 
Table 8. 
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The second column of Table 8 indicates the CLUSTER, to which the gene sequence belongs, 
whose the accession number of which is written in the first column. In addition to the values 
of gene expression of the sequences that constitute the combination, the values of the six 
extracted parameters and the relative output value assigned by the system are also shown. 

[121] Setting aside the calculation of the first three parameters PI, P2 and P3 relating 
to the linear and quadratic correlation of the combination, for the sake of simplicity, the 
procedure for calculating the parameters P4, P5 and P6 is described for the combination C26- 
C30ofTable8. 

[122] Calculation of P4: In order to calculate the parameter P4, which is the 
percentage of genes whose final value is greater than the initial value, the variation A between 
the value of gene expression corresponding to the last instant alphall9 and the value of 
expression corresponding to the first instant alphaO must be considered. For the first sequence 
YPR120C the variation A is: 

A = -0.43 - (- 0.92) = 0.49 => A > 0 

[123] This calculation is performed for all the sequences of the combination. The 
result is that 21 sequences out of 30, that is 70% of the sequences have a positive variation 
(A>0). This percentage, which will be indicated with the variable "PERCENTJVALUE" 
hereinafter, is transformed in a value comprised between 0 and 1. 

[124] The transformation procedure of this variable is: 

When PERCENT_VALUE=50%, P4 is null. 

When 50%<PERCENT_VALUE<70%, P4 is a value ranging from 
0 to 0.5 (the greater the percent value, the greater the value of the 
parameter). 

When 70%<PERCENT_VALUE< 100%, P4 is a value ranging from 
0.5 to 1 (the greater the percent value, the greater the parameter). 

When 30% <PERCENT_VALUE<50%, P4 is a value ranging from 
0 to 0.5 (the smaller the percent value, the greater the value of 
the parameter). 
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When 0% <PERCENT_VALUE<30%, P4 is a value ranging from 
0.5 to 1 (the smaller the percent value, the greater the value of 
the parameter). 

[125] In practice, the closer to 50% is the percentage of genes having the same 
variation, the closer to 0 is the value of the fourth parameter P4. The closer to 100% is the 
PERCENT_VALUE, the closer to 1 is the parameter P4, because the majority of genes in the 
group have similar behavior. 

[126] In the case in which the percentage is small and close to 0, the value of the 
parameter P4 is great and is close to 1 . This is due to the fact that small percentages of genes 
having a positive variation (A>0) imply high percentages of sequences with negative variation 
(A<0). This parameter allows identification of groups of genes with a similar behavior from 
the point of view of the global variation of the value of gene expression, independently from 
the sign of the variation. Thus, for example, genes turned or on in response to a drug can be 
identified. In the considered example, given that the percent value is 70%, P4 has a value 
equal to 0.5 (see Table 8). 

[127] Calculation of P5: In order to calculate P5, which is the percentage of genes 
with the same time evolution, it must be verified whether the pattern of gene expression is 
increasing or decreasing with time. 

[128] Given that the values of gene expression are sampled, the variations A/ between 
the gene expression value corresponding to the (i) th instant and the value of the gene 
expression corresponding to the (i-l)* for i=l, 2, n being n the number of experiments, 
must be calculated. In this specific case n=18 and for each sequence n-1 (17) values are 
calculated. 

[129] For example, for the first sequence YPR120C the values of the variations Ai 

are: 

Al = alpha7-alpha0 - -0.32 - ( 0.92) = 0.6 > 0 
A2 = alphal4-alpha7 = 0.98 - (-0.32) = 1.3 > 0 

30 
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A3 = alpha21-alphal4 = 1.03-0.98 = 0.05 > 0 
A4 = alpha28-alpha21 = 0.32 -1.03 < 0 
A5 = alpha35-alpha28 = -0.03-0.32 < 0 
A6 = alpha42-alpha35 = -0.12+0.03 < 0 
A7 = alpha49-alpha42 = -0.34+0. 12 < 0 
A8 = alpha56-alpha49 = -0.29+0.34 > 0 
A9 = alpha63-alpha56 = -0.27+0.29 > 0 
A10 = alpha70-alpha63 = 0.76+0.27 > 0 
Al 1 = alpha77-alpha70 = 0.67-0.76 < 0 
A12 = alpha84-alpha77 = 0.37-0.67 < 0 
A13 = alpha91-alpha84 = -0.17-0.37 < 0 
A14 = alpha98-alpha91 - 0.16+0.17 > 0 
A15 = alphal05-alpha98 = -0.14-0.16 < 0 
A16 = alpha 1 12-alpha 105 = -0.15 +0.14 < 0 
A17 = alphal 19-alphall2 = -0.43 +0.15 < 0 

[130] If the number of positive variations Ai is greater than the number of negative 
variations Ai, the time evolution of the sequence is globally increasing, vice versa it is globally 
decreasing. 

[131] For the sequence YPR120C, the number of positive variations Ai is 7, while the 
number of negative variations Ai is 10. Given that the number of positive variations Ai is 
smaller than the number of negative variations Ai, the sequence is considered having a 
decreasing time evolution. 

[132] The same calculation is repeated for each of the remaining 29 sequences of the 
combination illustrated in Table 8, showing that a certain percentage of sequences (indicated 
hereinafter with the variable "PERCENT") has a globally increasing time evolution. To this 
percent value a parameter P5, whose value ranges between 0 and 1, is associated with a 
procedure similar to that described for the parameter P4. 
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[133] In the considered case, 5 sequences out of 30, that is 16.7% have a globally 
increasing time evolution, thus P5 is large. In fact 83.3% of sequences have a decreasing 
evolution and thus a higher percentage of sequences have similar time evolutions. 

[134] It is worth remarking that the value of parameter P5 does not depend on the 
fact that the global evolution of the majority of genes is increasing or decreasing, but on how 
many genes of the combination have the same evolution. In the considered case, the 
parameter (P5) is close to 1 . 

[135] A possible procedure for evaluating P5 is described in detail hereinafter. 

[136] Let us define three threshold values: THRESHOLD 1=0.3; THRESHOLD2= 1 - 
THRESHOLDl=0.7; THRESHOLD3=0.5. In function of the thresholds the following values 
are calculated: 

valuel=((threshold2-threshold3)/(l-threshold3))=0.4; 

threshold2=((2*threshold2-l+threshold3)/(2*threshold3))=0.9; 

If threshold2<=percent<= 1 , P5=((percent-threshold 1 )/( 1 - 
value 1)); 

If 0<=percent<=threshold 1 , P5=((( 1 -percent)-value 1 )/( 1 - 
value 1)); 

If 0.5<=percent<threshold2, P5=((percent-0.5)/(value2-0.5)); 
Ifthresholdl<percent<0.5, P5=(((l-percent)-0.5)/(value2-0.5)). 

[137] A percentage of 50% corresponds to a null value of P5 because, as previously 
stated, in this case there is not any prevalent (increasing or decreasing) time evolution of the 
sequences that constitute the combination. 

[138] In the proposed example (Table 8) the value of P5 is given by: 

P5=(((1-PERCEOT)-VAL^ 

[139] Calculation of P6: In order to calculate P6, which is the percentage of genes 
with a peak occurring at the same instant, it must be verified whether the absolute value of the 
variation Ai exceeds a certain threshold value. Given that the values relative to each 
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experiment have been normalized between 0 and 1, a value of 0.5, which is one half of the 
normalization interval, was chosen as threshold value. Table 9 contains the gene expression 
values normalized between 0 and 1 of the sequences constituting the combination C26-C30. 
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[140] For sequence YPR120C is: 

I Al I =) alpha7-alpha0 1 = 0.307 
I A2 H alphal4-alpha7 1 = 0.666 

I A17H alphall9-alphall2| =0.14359 

[141] Repeating these calculations for all the sequences, the results reported in Table 
10 are obtained. 
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[142] The values of | A | that surpass the threshold are underlined. For each sequence 
the maximum value of | A | that surpasses the threshold is the peak (maximum value) and it is 
highlighted with a continuous line perimeter. For example, for the first gene sequence 
YPR120C there are only two values greater than the threshold, |A2| = 0.66 and | A10 | = 0.52; 
the peak in this case is | A2 |. 

[143] It must be noticed that not all gene sequences necessarily show a peak. In the 
proposed example the sequences YJL115W, YCR065W, YOR074C, YKL113C, YKL076W, 
YER001W and YDR309C do not show a peak. 

[144] For calculating the sixth parameter P6, the maximum number of peaks 
occurring at the same instant must be considered. In this example, the maximum number of 
peaks is 17 and they are in the column of | A2 |. In particular, 56.7% of sequences (17 
sequences out of 30) of the combination have a coincident peak and thus in this case P6 is 
equal to 0.57, as shown in Table 8. 

[145] The following list of References is provided for reader review. The identified 
references are incorporated by reference herein. 
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[146] Although preferred embodiments of the method and apparatus of the present 
invention have been illustrated in the accompanying Drawings and described in the foregoing 
Detailed Description, it will be understood that the invention is not limited to the 
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embodiments disclosed, but is capable of numerous rearrangements, modifications and 
substitutions without departing from the spirit of the invention as set forth and defined by the 
following claims. 
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